作者:Chao Huang Yapeng Tian Anurag Kumar Chenliang Xu
人类通过将声音和视觉统一在第一人称视角中,自然地感知周围的场景。同样,机器通过从自我中心的角度利用多感官输入进行学习,从而接近人类智能。在本文中,我们探索了具有挑战性的以自我为中心的视听对象定位任务,并观察到1)自我运动通常存在于第一人称录音中,即使是在短时间内;2) 当佩戴者转移注意力时,可以产生视野外的声音成分。为了解决第一个问题,我们提出了一个几何感知的时间聚合模块来显式处理自运动。自我运动的影响通过估计时间几何变换并利用它来更新视觉表示来减轻。此外,我们提出了一个级联的特征增强模块来解决第二个问题。它通过解开视觉指示的音频表示来提高跨模态定位的鲁棒性。在训练过程中,我们利用了natu
Humans naturally perceive surrounding scenes by unifying sound and sight in afirst-person view. Likewise, machines are advanced to approach humanintelligence by learning with multisensory inputs from an egocentricperspective. In this paper, we explore the challenging egocentric audio-visualobject localization task and observe that 1) egomotion commonly exists infirst-person recordings, even within a short duration; 2) The out-of-view soundcomponents can be created while wearers shift their attention. To address thefirst problem, we propose a geometry-aware temporal aggregation module tohandle the egomotion explicitly. The effect of egomotion is mitigated byestimating the temporal geometry transformation and exploiting it to updatevisual representations. Moreover, we propose a cascaded feature enhancementmodule to tackle the second issue. It improves cross-modal localizationrobustness by disentangling visually-indicated audio representation. Duringtraining, we take advantage of the naturally available audio-visual temporalsynchronization as the “free” self-supervision to avoid costly labeling. Wealso annotate and create the Epic Sounding Object dataset for evaluationpurposes. Extensive experiments show that our method achieves state-of-the-artlocalization performance in egocentric videos and can be generalized to diverseaudio-visual scenes.
论文链接:http://arxiv.org/pdf/2303.13471v1
更多计算机论文:http://cspaper.cn/